logo

Introduction

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise, and avoid long printouts. Feel free to add in as many new code chunks as you’d like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops). Please do not bring in any outside data, and use the provided data as truth (for example, some “home” games have been played at secondary locations, including TOR’s entire 2020-21 season. These are not reflected in the data and you do not need to account for this.) Note that the OKC and DEN 2024-25 schedules in schedule_24_partial.csv intentionally include only 80 games, as the league holds 2 games out for each team in the middle of December due to unknown NBA Cup matchups. Do not assign specific games to fill those two slots.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. We may refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers

Part 1

Question 1: 26 4-in-6 stretches in OKC’s draft schedule.

Question 2: 25.1 4-in-6 stretches on average.

Question 3:

  • Most 4-in-6 stretches on average: CHA (28.1)
  • Fewest 4-in-6 stretches on average: NYK (22.2)

Question 4: This is a written question. Please leave your response in the document under Question 4.

Question 5:

  • BKN Defensive eFG%: 54.5%
  • When opponent on a B2B: 53.6%

Part 2

Please show your work in the document, you don’t need anything here.

Part 3

Question 9:

  • Most Helped by Schedule: LAC (10.6 wins)
  • Most Hurt by Schedule: DET (-15.5 wins)

Setup and Data

library(tidyverse)
# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
schedule <- read_csv("schedule.csv")
draft_schedule <- read_csv("schedule_24_partial.csv")
locations <- read_csv("locations.csv")
game_data <- read_csv("team_game_data.csv")

Part 1 – Schedule Analysis

In this section, you’re going to work to answer questions using NBA scheduling data.

Question 1

QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)

okc_schedule <- draft_schedule %>%
  filter(team == "OKC") %>%                 # keep only OKC, filter out DEN
  mutate(gamedate = as.Date(gamedate)) %>%  # ensure Date class (YYYY-MM-DD)
  arrange(gamedate) %>%                     # sort by date
  mutate(
    # date of the game 3 rows earlier (NA for first 3 rows)
    first_in_4 = lag(gamedate, 3),

    # integer difference in days between current game and the 1st of the 4-game window
    days_between_4th_and_1st = as.integer(gamedate - first_in_4),

    # flag = 1 when the 4th game's date is within 5 days of the 1st (i.e., in a 6-day window)
    fourth_game_flag = if_else(!is.na(days_between_4th_and_1st) & days_between_4th_and_1st <= 5,
                               1L, 0L)
  )

# Count how many times the flag is 1
total_flags <- sum(okc_schedule$fourth_game_flag, na.rm = TRUE)
total_flags
## [1] 26

ANSWER 1:

26 4-in-6 stretches in OKC’s draft schedule.

Question 2

QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.

full_schedule <- schedule %>%
  mutate(gamedate = as.Date(gamedate))
count_four_in_six <- function(df) {
  df %>%
    arrange(gamedate) %>%
    mutate(first_in_4 = lag(gamedate, 3),
           diff_days = as.integer(gamedate - first_in_4),
           four_in_six_flag = if_else(!is.na(diff_days) & diff_days <= 5, 1L, 0L)) %>%
    summarise(
      games_played = n(),
      four_in_six = sum(four_in_six_flag, na.rm = TRUE)
    )
}

# Apply per team-season
team_season_counts <- schedule %>%
  group_by(season, team) %>%
  group_modify(~ count_four_in_six(.x)) %>%
  ungroup() %>%
  # Scale counts to per-82 games
  mutate(four_in_six_per82 = four_in_six * 82 / games_played)

# Compute final league-wide average across all team-seasons
final_avg <- mean(team_season_counts$four_in_six_per82, na.rm = TRUE)

final_avg
## [1] 25.10331

ANSWER 2:

25.1 4-in-6 stretches on average.

Question 3

QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.

library(tidyverse)

# Load data
schedule <- read_csv("schedule.csv") %>%
  mutate(gamedate = as.Date(gamedate))
## Rows: 23958 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): team, opponent
## dbl  (3): season, home, win
## date (1): gamedate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Function to count 4-in-6 stretches for one team-season
count_four_in_six <- function(df) {
  df %>%
    arrange(gamedate) %>%
    mutate(first_in_4 = lag(gamedate, 3),
           diff_days = as.integer(gamedate - first_in_4),
           four_in_six_flag = if_else(!is.na(diff_days) & diff_days <= 5, 1L, 0L)) %>%
    summarise(
      games_played = n(),
      four_in_six = sum(four_in_six_flag, na.rm = TRUE)
    )
}

# Apply per team-season
team_season_counts <- schedule %>%
  group_by(season, team) %>%
  group_modify(~ count_four_in_six(.x)) %>%
  ungroup() %>%
  # Scale to per-82 games
  mutate(four_in_six_per82 = four_in_six * 82 / games_played)

# Average per team across 2014-15 to 2023-24
team_avgs <- team_season_counts %>%
  group_by(team) %>%
  summarise(avg_four_in_six_per82 = mean(four_in_six_per82, na.rm = TRUE)) %>%
  arrange(desc(avg_four_in_six_per82))

# Find highest and lowest
highest <- team_avgs %>% slice(1)
lowest  <- team_avgs %>% slice(n())

highest
## # A tibble: 1 × 2
##   team  avg_four_in_six_per82
##   <chr>                 <dbl>
## 1 CHA                    28.1
lowest
## # A tibble: 1 × 2
##   team  avg_four_in_six_per82
##   <chr>                 <dbl>
## 1 NYK                    22.2

ANSWER 3:

  • Most 4-in-6 stretches on average: CHA (28.1)
  • Fewest 4-in-6 stretches on average: NYK (22.2)

Question 4

QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?

ANSWER 4:

I would not say that the difference is surprising, but I definitely do not think it is chance. Given that the two teams are at opposite ends of the spectrum in terms of market size, it makes sense that a team like the Knicks would be given a lighter load. A team like the Knicks is going to play in many more primetime slot games than the Hornets, so the NBA would have an incentive to make sure they give their players the best chance at being healthy and rested. Having Knicks’ stars sit out due to a heavy load likely has a greater negative impact on the NBA than Hornets’ stars doing the same.

Question 5

QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?

library(dplyr)

# Make sure gamedate is Date type
game_data <- game_data %>%
  mutate(gamedate = as.Date(gamedate))

# Step 1: Filter for Brooklyn defense in 2023-24
bkn_def <- game_data %>%
  filter(season == 2023, def_team == "BKN") %>%
  mutate(def_efg = (fgmade + 0.5 * fg3made) / fgattempted)

# Step 2: Overall defensive eFG% for the season
bkn_overall <- mean(bkn_def$def_efg, na.rm = TRUE)

# Step 3: Compute opponent rest (across all teams first)
rest_info <- game_data %>%
  arrange(off_team, gamedate) %>%
  group_by(off_team) %>%
  mutate(prev_game = lag(gamedate),
         days_rest = as.integer(gamedate - prev_game),
         b2b_flag = if_else(days_rest == 1, 1L, 0L)) %>%
  ungroup() %>%
  select(off_team, gamedate, b2b_flag)

# Step 4: Join opponent rest info onto Brooklyn defensive games
bkn_def <- bkn_def %>%
  left_join(rest_info, by = c("off_team", "gamedate"))

# Step 5: Defensive eFG% when opponent was on 2nd night of B2B
bkn_b2b <- bkn_def %>%
  filter(b2b_flag == 1) %>%
  summarise(def_efg_b2b = mean(def_efg, na.rm = TRUE)) %>%
  pull(def_efg_b2b)

# Results
bkn_overall   # overall defensive eFG% (2023-24)
## [1] 0.5450564
bkn_b2b       # defensive eFG% vs opponents on 2nd of back-to-back
## [1] 0.5363431

ANSWER 5:

  • BKN Defensive eFG%: 54.5%
  • When opponent on a B2B: 53.6%

Part 3 – Modeling

Question 9

QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.

If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).

# --- Packages ---
library(readr)
library(slider)
## Warning: package 'slider' was built under R version 4.4.3
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.4.3
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice
# --- Load data ---
games <- read_csv("team_game_data_with_distances.csv", show_col_types = FALSE) %>%
  filter(season %in% 2019:2023, gametype == 2) %>%    # 2019–20 through 2023–24 RS
  mutate(
    gamedate = as.Date(gamedate),
    team = row_team,
    opp  = def_team,
    home = as.integer(row_is_home),
    win  = as.integer(off_win),
    dist = ifelse(is.na(distance_miles), 0, distance_miles)
  ) %>%
  arrange(team, gamedate)

# --- Rest days, B2B, road-trip counter ---
games <- games %>%
  group_by(team, season) %>%
  arrange(gamedate, .by_group = TRUE) %>%
  mutate(
    rest_days = as.numeric(gamedate - lag(gamedate)),
    rest_days = ifelse(is.na(rest_days), NA, pmax(rest_days - 1, 0)),
    b2b = rest_days == 0,
    road_trip_day = {
      rt <- integer(n())
      run <- 0L
      for (i in seq_len(n())) {
        if (home[i] == 0L) { run <- run + 1L; rt[i] <- run } else { run <- 0L; rt[i] <- 0L }
      }
      rt
    }
  ) %>%
  ungroup()

# --- Opponent form (last-20 win%) ---
games <- games %>%
  group_by(team, season) %>%
  arrange(gamedate, .by_group = TRUE) %>%
  mutate(
    team_last20 = slide_dbl(lag(win), ~ mean(.x, na.rm = TRUE), .before = 19, .complete = FALSE)
  ) %>%
  ungroup()

opp_feats <- games %>%
  select(season, gamedate, team, team_last20) %>%
  rename(opp = team, opp_last20 = team_last20)

games <- games %>%
  left_join(opp_feats, by = c("season", "gamedate", "opp"))

# Replace early NAs with league average for that date
games <- games %>%
  group_by(season, gamedate) %>%
  mutate(
    opp_last20 = ifelse(is.na(opp_last20), mean(opp_last20, na.rm = TRUE), opp_last20)
  ) %>%
  ungroup()

# --- Model data ---
model_data <- games %>%
  select(win, home, b2b, road_trip_day, rest_days, dist, opp_last20, season, team) %>%
  filter(complete.cases(.))

# Encode season as numeric for XGBoost
model_data$season <- as.numeric(as.factor(model_data$season))

# --- Prepare matrices ---
X <- model_data %>% select(-win, -team)
y <- model_data$win
feature_names <- colnames(X)

dtrain <- xgb.DMatrix(data = as.matrix(X), label = y)
colnames(dtrain) <- feature_names

# --- Fit XGBoost model ---
set.seed(123)
xgb_model <- xgboost(
  data = dtrain,
  objective = "binary:logistic",
  eval_metric = "logloss",
  nrounds = 300,
  max_depth = 5,
  eta = 0.05,
  subsample = 0.8,
  colsample_bytree = 0.8,
  verbose = 0
)

# --- Predictions (actual schedule) ---
model_data$p_hat_actual <- predict(xgb_model, newdata = dtrain)

# --- Counterfactual A: Opponent-neutralized ---
X_cf_opp <- model_data %>%
  mutate(opp_last20 = 0.50) %>%       # all opponents = average strength
  select(all_of(feature_names))

dtest_cf_opp <- xgb.DMatrix(data = as.matrix(X_cf_opp))
colnames(dtest_cf_opp) <- feature_names

model_data$p_hat_cf_opp <- predict(xgb_model, newdata = dtest_cf_opp)

# --- Counterfactual B: Schedule-neutralized ---
global_medians <- model_data %>%
  summarise(
    med_rest = median(rest_days, na.rm = TRUE),
    med_dist = median(dist, na.rm = TRUE)
  )

X_cf_sched <- model_data %>%
  mutate(
    b2b = 0,
    rest_days = global_medians$med_rest,
    road_trip_day = 1,
    dist = global_medians$med_dist
  ) %>%
  select(all_of(feature_names))

dtest_cf_sched <- xgb.DMatrix(data = as.matrix(X_cf_sched))
colnames(dtest_cf_sched) <- feature_names

model_data$p_hat_cf_sched <- predict(xgb_model, newdata = dtest_cf_sched)

# --- Compute effects ---
model_data <- model_data %>%
  mutate(
    opp_effect   = p_hat_actual - p_hat_cf_opp,
    sched_effect = p_hat_actual - p_hat_cf_sched,
    total_effect = opp_effect + sched_effect
  )

# --- Aggregate to TOTAL WINS gained/lost per team ---
team_schedule_effects <- model_data %>%
  group_by(team) %>%
  summarise(
    wins_from_schedule = sum(total_effect, na.rm = TRUE),
    total_games = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(wins_from_schedule))

cat("\n=== TOTAL Wins gained (+) / lost (–) from schedule (Opponent + Grind effects), 2019–20 through 2023–24 ===\n")
## 
## === TOTAL Wins gained (+) / lost (–) from schedule (Opponent + Grind effects), 2019–20 through 2023–24 ===
print(team_schedule_effects, n = nrow(team_schedule_effects))
## # A tibble: 30 × 3
##    team  wins_from_schedule total_games
##    <chr>              <dbl>       <int>
##  1 LAC               10.6           385
##  2 BOS                8.37          385
##  3 LAL                6.44          384
##  4 GSW                3.84          378
##  5 SAC                3.51          385
##  6 MIA                3.40          386
##  7 DAL                3.20          388
##  8 PHI                3.13          386
##  9 TOR                2.46          385
## 10 MIL                1.95          386
## 11 PHX                1.76          386
## 12 DEN                1.07          386
## 13 BKN                0.899         385
## 14 NYK                0.357         379
## 15 OKC               -0.398         385
## 16 POR               -1.90          387
## 17 IND               -1.93          386
## 18 CLE               -2.57          378
## 19 MEM               -2.84          386
## 20 CHI               -2.94          378
## 21 UTA               -3.56          385
## 22 MIN               -4.22          377
## 23 CHA               -4.92          378
## 24 ATL               -5.13          380
## 25 HOU               -6.53          385
## 26 WAS               -7.07          385
## 27 NOP               -7.54          385
## 28 ORL               -7.86          386
## 29 SAS              -11.0           384
## 30 DET              -15.5           379
library(Ckmeans.1d.dp)  # required by xgb.plot.importance
## Warning: package 'Ckmeans.1d.dp' was built under R version 4.4.3
# --- Variable importance ---
importance <- xgb.importance(model = xgb_model)

cat("\n=== XGBoost Variable Importance (by Gain) ===\n")
## 
## === XGBoost Variable Importance (by Gain) ===
print(importance)
##          Feature       Gain      Cover  Frequency
##           <char>      <num>      <num>      <num>
## 1:    opp_last20 0.38828819 0.31817192 0.25425669
## 2:          dist 0.32619155 0.37415631 0.34640149
## 3:        season 0.08223931 0.07104842 0.13006153
## 4: road_trip_day 0.08014238 0.09045837 0.10945772
## 5:     rest_days 0.06024628 0.10434518 0.09700959
## 6:          home 0.03329268 0.01720818 0.02632709
## 7:           b2b 0.02959962 0.02461162 0.03648591
# --- Plot importance ---
xgb.plot.importance(importance_matrix = importance, top_n = 10)

ANSWER 9:

  • Most Helped by Schedule: LAC (10.6 wins)
  • Most Hurt by Schedule: DET (-15.5 wins)

To estimate how much teams were helped or hurt by their schedules from 2019–20 through 2023–24, I trained an XGBoost model to predict the probability of winning each regular season game. The model included variables that take into account recent strength of opponent and schedule difficulty: whether the game was at home or away, whether it was the second night of a back-to-back (b2b), how deep into a road trip the team was (road_trip_day), the number of rest days before the game (rest_days), travel distance (dist), and the opponent’s recent form over their last 20 games (opp_last20). Season indicators were included to allow for year-to-year baseline changes.

After training, I constructed two counterfactual scenarios: (1) opponent-neutralized, where every opponent was treated as a “.500 team” (10–10 in their last 20), and (2) schedule-neutralized, where every team’s back-to-backs, rest days, and travel were reset to neutral league-median values. Comparing the model’s predicted win probabilities under the actual schedule and the counterfactual schedules provides estimates of how much each team gained or lost due to opponent strength and schedule density. Summing across games gives a total wins added/lost from scheduling factors over the five-year span.